Import packages

library(pastecs)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
## Warning: package 'psych' was built under R version 3.5.2
library(purrr)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/modules/R_X11.so
##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(igraph)
## Warning: package 'igraph' was built under R version 3.5.2
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(MatchIt)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tableone)
## Warning: package 'tableone' was built under R version 3.5.2

Import data

df <- read.csv('HighNote Data Midterm.csv')
head(df)
##   ID age male friend_cnt avg_friend_age avg_friend_male friend_country_cnt
## 1  1  22    0          8       22.57143       0.4285714                  1
## 2  2  35    0          2       28.00000       1.0000000                  2
## 3  3  27    1          2       23.00000       1.0000000                  1
## 4  4  21    0         28       22.94737       0.5000000                  7
## 5  5  24    0         65       22.28302       0.9137931                  9
## 6  6  21    1         12       25.00000       0.7777778                  1
##   subscriber_friend_cnt songsListened lovedTracks posts playlists shouts
## 1                     0          9687         194     0         1      8
## 2                     0             0           0     0         0      0
## 3                     0           508           0     0         1      2
## 4                     1          1357          32     0         0      1
## 5                     0         89984          20     2         0     81
## 6                     0        124547          10     0         1      2
##   adopter tenure good_country
## 1       0     59            1
## 2       0     35            0
## 3       0     42            0
## 4       0     25            0
## 5       0     67            0
## 6       0     53            1

#1. Summary statistics

describeBy(df, df$adopter)
## 
##  Descriptive statistics by group 
## group: 0
##                       vars     n     mean       sd   median  trimmed
## ID                       1 40300 20150.50 11633.75 20150.50 20150.50
## age                      2 40300    23.95     6.37    23.00    23.09
## male                     3 40300     0.62     0.48     1.00     0.65
## friend_cnt               4 40300    18.49    57.48     7.00    10.28
## avg_friend_age           5 40300    24.01     5.10    23.00    23.40
## avg_friend_male          6 40300     0.62     0.32     0.67     0.65
## friend_country_cnt       7 40300     3.96     5.76     2.00     2.66
## subscriber_friend_cnt    8 40300     0.42     2.42     0.00     0.13
## songsListened            9 40300 17589.44 28416.02  7440.00 11817.64
## lovedTracks             10 40300    86.82   263.58    14.00    36.35
## posts                   11 40300     5.29   104.31     0.00     0.23
## playlists               12 40300     0.55     1.07     0.00     0.45
## shouts                  13 40300    29.97   150.69     4.00     8.84
## adopter                 14 40300     0.00     0.00     0.00     0.00
## tenure                  15 40300    43.81    19.79    44.00    43.72
## good_country            16 40300     0.36     0.48     0.00     0.32
##                            mad min     max   range  skew kurtosis     se
## ID                    14937.19   1   40300   40299  0.00    -1.20  57.95
## age                       4.45   8      79      71  1.97     6.80   0.03
## male                      0.00   0       1       1 -0.50    -1.75   0.00
## friend_cnt                7.41   1    4957    4956 32.67  2087.42   0.29
## avg_friend_age            3.95   8      77      69  1.84     7.15   0.03
## avg_friend_male           0.35   0       1       1 -0.52    -0.72   0.00
## friend_country_cnt        1.48   0     129     129  4.74    38.29   0.03
## subscriber_friend_cnt     0.00   0     309     309 72.19  8024.62   0.01
## songsListened         10576.87   0 1000000 1000000  6.05   105.85 141.55
## lovedTracks              20.76   0   12522   12522 13.12   335.93   1.31
## posts                     0.00   0   12309   12309 73.92  7005.34   0.52
## playlists                 0.00   0      98      98 28.21  1945.28   0.01
## shouts                    4.45   0    7736    7736 22.53   779.12   0.75
## adopter                   0.00   0       0       0   NaN      NaN   0.00
## tenure                   22.24   1     111     110  0.05    -0.70   0.10
## good_country              0.00   0       1       1  0.59    -1.65   0.00
## -------------------------------------------------------- 
## group: 1
##                       vars    n     mean       sd   median  trimmed
## ID                       1 3527 42064.00  1018.30 42064.00 42064.00
## age                      2 3527    25.98     6.84    24.00    25.05
## male                     3 3527     0.73     0.44     1.00     0.79
## friend_cnt               4 3527    39.73   117.27    16.00    23.69
## avg_friend_age           5 3527    25.44     5.21    24.36    24.83
## avg_friend_male          6 3527     0.64     0.25     0.67     0.65
## friend_country_cnt       7 3527     7.19     8.86     4.00     5.36
## subscriber_friend_cnt    8 3527     1.64     5.85     0.00     0.84
## songsListened            9 3527 33758.04 43592.73 20908.00 25811.69
## lovedTracks             10 3527   264.34   491.43   108.00   161.68
## posts                   11 3527    21.20   221.99     0.00     1.44
## playlists               12 3527     0.90     2.56     1.00     0.59
## shouts                  13 3527    99.44  1156.07     9.00    23.89
## adopter                 14 3527     1.00     0.00     1.00     1.00
## tenure                  15 3527    45.58    20.04    46.00    45.60
## good_country            16 3527     0.29     0.45     0.00     0.23
##                            mad   min    max  range  skew kurtosis     se
## ID                     1307.65 40301  43827   3526  0.00    -1.20  17.15
## age                       4.45     8     73     65  1.68     4.39   0.12
## male                      0.00     0      1      1 -1.03    -0.94   0.01
## friend_cnt               17.79     1   5089   5088 26.04  1013.79   1.97
## avg_friend_age            3.91    12     62     50  1.68     5.05   0.09
## avg_friend_male           0.25     0      1      1 -0.54    -0.05   0.00
## friend_country_cnt        4.45     0    136    136  3.61    24.53   0.15
## subscriber_friend_cnt     0.00     0    287    287 34.05  1609.52   0.10
## songsListened         23276.82     0 817290 817290  4.71    46.64 734.03
## lovedTracks             140.85     0  10220  10220  6.52    80.96   8.27
## posts                     0.00     0   8506   8506 26.52   852.38   3.74
## playlists                 1.48     0    118    118 28.84  1244.31   0.04
## shouts                   11.86     0  65872  65872 52.52  2969.09  19.47
## adopter                   0.00     1      1      0   NaN      NaN   0.00
## tenure                   20.76     0    111    111  0.02    -0.62   0.34
## good_country              0.00     0      1      1  0.94    -1.12   0.01
free <- df[df$adopter == 0,]
premium <- df[df$adopter == 1,]

#stat.desc(free)
#stat.desc(premium)

#t-test
output = NULL
for (i in 2:16){
  if(i == 14) next()
  name = names(free[i])
  p_value = t.test(free[i], premium[i])$p.value
  output <- rbind(output, data.frame(name, p_value))
}
print(output)
##                     name      p_value
## 1                    age 1.207042e-62
## 2                   male 1.388253e-41
## 3             friend_cnt 4.360547e-26
## 4         avg_friend_age 9.940507e-54
## 5        avg_friend_male 9.096630e-06
## 6     friend_country_cnt 6.523070e-95
## 7  subscriber_friend_cnt 4.994062e-34
## 8          songsListened 6.334710e-98
## 9            lovedTracks 3.839467e-94
## 10                 posts 2.556983e-05
## 11             playlists 8.619207e-16
## 12                shouts 3.673566e-04
## 13                tenure 4.767641e-07
## 14          good_country 1.941292e-18

From the above descriptive statistics, there are 40,300 free users (group 0/adopter = 0) and 3527 premium users (group 1/adopter = 1). For premium users, almost all its variables have higher mean values than that of free users. For instance, the average age of premium users is 25.98, older than that of free users, 23.95. It can be inferred that younger users would like to choose free service on High Note platform due to their income or other relevant reasons. In terms of varaibles of avg_friend_cnt and singListened, for instance, the premium users have more average friend count than free users. Also, they listened more songs (songListened variable), which is 33,758 on average, than 17,589 songs listened by free users. In a nutshell, the overall performance of all variables for premium users is better than free users. It can be assumed that premium users, as opposed to free users, are more likely to be loyal on the High Note music platform.

Doing t-test on all variables for adopters (premium users) and non-adopters (free users), p-values of all variables are statistically significant. It means that there are huge difference between two groups, in terms of all variables.

#2. Data Visualization (i) demographics

#age
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
df$Adopter <- factor(df$adopter, levels = c(0, 1))
means <- ddply(df, "Adopter", summarise, age.mean = mean(age))
means
##   Adopter age.mean
## 1       0 23.94844
## 2       1 25.97987
ggplot(df, aes(x = age, group = Adopter, fill = Adopter)) +
  geom_density(alpha = .3) + 
  geom_vline(data = means, aes(xintercept = age.mean, colour = Adopter),
             linetype = "longdash", size=1)

#male

df$Male <- factor(df$male, levels = c(0, 1))
df %>% group_by(Adopter, Male) %>% tally() %>%
  ggplot(aes(Male, n, group = Adopter, fill = Adopter)) + geom_bar(position = "dodge", stat = "identity") + labs(x = 'male', y = 'count', title = 'Male Count by Adopter') 

#good country
df$Good_country <- factor(df$good_country, levels = c(0, 1))

df %>% group_by(Adopter, Good_country) %>% tally() %>%
  ggplot(aes(Good_country, n, group = as.factor(Adopter), fill = Adopter)) + geom_bar(position = "dodge", stat = "identity") + labs(x = 'good_country', y = 'count', title = 'Count of Good Country by Adopter') 

ggplot(df, aes(x=Adopter, y = Good_country))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter") 
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`

After descriptive statistics, I checked the difference of all variables through visualized plots. First, I looked to the visualization of demographic variables, including age, male, and good country. The age for two types of users are all right skewed, yet the free users are younger than the premium users, in general. As for the number of males, the premium users have more male than free users, while for the users from good country, it seems that free users come from good country more than premium users do.

  1. peer influence
#mean value of peer influence variables
col <- c(colnames(df)[4:8], 'adopter')
peer_mean<- df %>%
  group_by(adopter) %>%
  select(one_of(col)) %>%
  summarise_all(funs(mean)) 
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
peer_mean
## # A tibble: 2 x 6
##   adopter friend_cnt avg_friend_age avg_friend_male friend_country_…
##     <int>      <dbl>          <dbl>           <dbl>            <dbl>
## 1       0       18.5           24.0           0.617             3.96
## 2       1       39.7           25.4           0.637             7.19
## # … with 1 more variable: subscriber_friend_cnt <dbl>
#distribution
ggpairs(df[4:8])

#plot mean value to see difference in each variable
p1<- ggplot(df, aes(x=Adopter, y = friend_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter") 
## Warning: Ignoring unknown parameters: fun.y
p2<- ggplot(df, aes(x=Adopter, y = avg_friend_age, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p3<-ggplot(df, aes(x=Adopter, y = avg_friend_male, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p4<-ggplot(df, aes(x=Adopter, y = friend_country_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p5<-ggplot(df, aes(x=Adopter, y = subscriber_friend_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
grid.arrange(p1, p2,p3,p4,p5, nrow = 3)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

In terms of variables related to peer influence, it includes friend_cnt, avg_friend_age, avg_friend_male, friend_country_cnt, subscriber_friend_cnt. Below is the barplot and distribution of mean values of all variables. They all have skewed distributions, based on the distribution matrix. In overall, the premium users have higher mean values of peer influence than free users have. That is to say premium users are more likely to be affected by their friends and also interact more with their friends as well.

  1. user engagement
df %>%
  group_by(adopter) %>%
  select(one_of(colnames(df)[9:15])) %>%
  summarise_all(funs(mean)) 
## # A tibble: 2 x 7
##   adopter songsListened lovedTracks posts playlists shouts tenure
##     <int>         <dbl>       <dbl> <dbl>     <dbl>  <dbl>  <dbl>
## 1       0        17589.        86.8  5.29     0.549   30.0   43.8
## 2       1        33758.       264.  21.2      0.901   99.4   45.6
#distribution
ggpairs(df[9:13])

#plot mean value to see difference in each variable
p1<- ggplot(df, aes(x=Adopter, y = songsListened, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter") 
## Warning: Ignoring unknown parameters: fun.y
p2<- ggplot(df, aes(x=Adopter, y = lovedTracks, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p3<-ggplot(df, aes(x=Adopter, y = posts, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p4<-ggplot(df, aes(x=Adopter, y = playlists, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p5<-ggplot(df, aes(x=Adopter, y = shouts, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p6<-ggplot(df, aes(x=Adopter, y = tenure, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 3)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

For the user engagement variables, it includes songListened, lovedTracks, posts, playlists, shouts, and tenure. T premium users have higher user engagement than free users, in terms of all mean values of these variables. That is to say, the premium users are more active on the High Note music platform.

#3. Propensity Score Matching (PSM) Use PSM to test whether having subscriber friends affects the likelihood of becoming an adopter (i.e. fee customer).

#split treatment and control group
df <- read.csv('HighNote Data Midterm.csv')
df <- mutate(df, treatment = ifelse(subscriber_friend_cnt>=1, 1, 0))

#run t-test before PSM
with(df, t.test(adopter ~ treatment)) #c-t: 0.052 - 0.178
## 
##  Welch Two Sample t-test
## 
## data:  adopter by treatment
## t = -30.961, df = 11815, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1330281 -0.1171869
## sample estimates:
## mean in group 0 mean in group 1 
##      0.05243501      0.17754250

Before doing propensity score matching, first, I split treatment group and control group based on the definition of whether users that have one or more subscriber friends. Then, I did t-test to check if treatment group and control group have significant difference on the behavior of choosing free and premium service (adopter). It ended up getting the p-value with statistically significant. So, two groups are different between adopter variable. The mean value of adopter for treatment group is 0.178 and that of control group is 0.052 or so. Therefore, it is necessary to do propensity score matching to make two groups more balance.

#logistic regression for PSM To make sure all the other variables are significant to control group and treatment group, I did logistic regression for all variables, excluding variables related to treatment and adopter. According to the distribution plots in the second session, we knew the distributions of all continuous variables are skewed. Thus, I took log transformation before doing logistic regression. The result shows that, if the alpha of model is 0.05, then all variables are significantly correlated to the dependent variable - the treatment. AIC is 31493.

logit<- glm(treatment ~log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1)  + log(avg_friend_male+1) +
                       log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +
                       log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, family = binomial(), data = df)
summary(logit)
## 
## Call:
## glm(formula = treatment ~ log(age) + male + log(friend_cnt + 
##     1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) + 
##     log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks + 
##     1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) + 
##     log(tenure + 1) + good_country, family = binomial(), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5576  -0.5682  -0.2997  -0.1170   3.3903  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -17.758481   0.301395 -58.921  < 2e-16 ***
## log(age)                      0.647953   0.095402   6.792 1.11e-11 ***
## male                          0.078393   0.031386   2.498   0.0125 *  
## log(friend_cnt + 1)           1.078788   0.027146  39.740  < 2e-16 ***
## log(avg_friend_age + 1)       3.486321   0.125283  27.828  < 2e-16 ***
## log(avg_friend_male + 1)      0.381212   0.090651   4.205 2.61e-05 ***
## log(friend_country_cnt + 1)   0.560058   0.032010  17.496  < 2e-16 ***
## log(songsListened + 1)        0.051766   0.009154   5.655 1.56e-08 ***
## log(lovedTracks + 1)          0.084539   0.007936  10.652  < 2e-16 ***
## log(posts + 1)                0.079169   0.014703   5.385 7.26e-08 ***
## log(playlists + 1)           -0.152139   0.035585  -4.275 1.91e-05 ***
## log(shouts + 1)              -0.028970   0.013678  -2.118   0.0342 *  
## log(tenure + 1)              -0.367109   0.031064 -11.818  < 2e-16 ***
## good_country                  0.059487   0.030370   1.959   0.0501 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46640  on 43826  degrees of freedom
## Residual deviance: 31465  on 43813  degrees of freedom
## AIC: 31493
## 
## Number of Fisher Scoring iterations: 6

After logistic regression, we use the model to predict propensity scores of treatment and control group.

#predicted propensity score
prs_df <- data.frame(pr_score = predict(logit, type = "response"),
                     treatment = logit$model$treatment)
head(prs_df)
##     pr_score treatment
## 1 0.07337272         0
## 2 0.04897846         0
## 3 0.02140195         0
## 4 0.42334070         1
## 5 0.64329312         0
## 6 0.15366186         0

I plotted the propensity score by two groups, which refers to the following distribution plots. It shows that the treatment group has consistent number of users among different levels of propensity scores. Yet the control group has right skewed distribution. It has more people with lower propensity score, who are less likely to turning into adopters, that is, premium users.

#plot histogran of treatment status to examine the region of common support
labs <- paste("Actual treatment type:", c("Treatment w/ subscriber friends", "Control w/n subscriber friends"))
prs_df %>%
  mutate(treatment = ifelse(treatment == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = pr_score)) +
  geom_histogram(color = "white") +
  facet_wrap(~treatment) +
  xlab("Probability of getting subscriber friends") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#create a prematching table
xvars <- colnames(df[!colnames(df) %in% c('ID', 'subscriber_friend_cnt', 'adopter', 'treatment')])
table1 <- CreateTableOne(vars = xvars, strata = "treatment", data = df, test = FALSE)
print(table1, smd = TRUE)
##                                 Stratified by treatment
##                                  0                   1                  
##   n                                 34004                9823           
##   age (mean (SD))                   23.75 (6.22)        25.37 (6.97)    
##   male (mean (SD))                   0.63 (0.48)         0.64 (0.48)    
##   friend_cnt (mean (SD))            10.43 (15.28)       54.02 (127.91)  
##   avg_friend_age (mean (SD))        23.76 (5.06)        25.39 (5.17)    
##   avg_friend_male (mean (SD))        0.61 (0.33)         0.64 (0.23)    
##   friend_country_cnt (mean (SD))     2.73 (3.10)         9.39 (10.01)   
##   songsListened (mean (SD))      14602.22 (23214.29) 33735.64 (43952.34)
##   lovedTracks (mean (SD))           65.21 (181.48)     225.36 (498.23)  
##   posts (mean (SD))                  2.54 (33.79)       20.52 (241.27)  
##   playlists (mean (SD))              0.53 (0.97)         0.74 (1.96)    
##   shouts (mean (SD))                16.42 (79.74)      101.82 (739.51)  
##   tenure (mean (SD))                43.20 (19.72)       46.55 (19.92)   
##   good_country (mean (SD))           0.35 (0.48)         0.34 (0.47)    
##                                 Stratified by treatment
##                                  SMD   
##   n                                    
##   age (mean (SD))                 0.246
##   male (mean (SD))                0.015
##   friend_cnt (mean (SD))          0.479
##   avg_friend_age (mean (SD))      0.319
##   avg_friend_male (mean (SD))     0.079
##   friend_country_cnt (mean (SD))  0.899
##   songsListened (mean (SD))       0.544
##   lovedTracks (mean (SD))         0.427
##   posts (mean (SD))               0.104
##   playlists (mean (SD))           0.139
##   shouts (mean (SD))              0.162
##   tenure (mean (SD))              0.169
##   good_country (mean (SD))        0.024

According to the pre-matching table as below, some variables between treatment group and control group are significantly different, in terms of mean difference. (*When SMD of a variable is larger than 0.1, we can say that this variable between two groups are different.)

Propensity Score Matching – Method 1: nearest neighbor matching After finding differe in the treatment status, use MatchIt to find pairs of observations that have very similar propensity scores

With predicted propensity score from logistic regression, I first used the nearest method to match treatment and control samples. The nearest method is to pair a treated subject to a control who is close in terms of its distance in covariate space. The result shows that 9823 samples got matched. For all data without matching, the distances (= propensity score) between treatment and control are 0.4982 and 0.1450. However, the distances of matched data between treatment and control are 0,4982 and 0.3447. The mean difference of distance of the matched data is 0.1535, lower than the distance of all data, 0.3532. There are 56.54% balance improvement of mean different of distance.

#checl null value and omit
sum(is.na(df)) #no null value
## [1] 0
#matchit
mod_match <- matchit(treatment ~ log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1)  + log(avg_friend_male+1) + log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) + log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, method = 'nearest', data = df)
                   
summary(mod_match, covariates = T)
## 
## Call:
## matchit(formula = treatment ~ log(age) + male + log(friend_cnt + 
##     1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) + 
##     log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks + 
##     1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) + 
##     log(tenure + 1) + good_country, data = df, method = "nearest")
## 
## Summary of balance for all data:
##                             Means Treated Means Control SD Control
## distance                           0.4982        0.1450     0.1682
## log(age)                           3.2014        3.1389     0.2312
## male                               0.6363        0.6288     0.4831
## log(friend_cnt + 1)                3.3261        1.9362     0.9467
## log(avg_friend_age + 1)            3.2562        3.1912     0.1847
## log(avg_friend_male + 1)           0.4814        0.4540     0.2270
## log(friend_country_cnt + 1)        1.9993        1.1226     0.5558
## log(songsListened + 1)             9.6020        7.9441     2.7002
## log(lovedTracks + 1)               3.9582        2.4466     1.9781
## log(posts + 1)                     0.8391        0.2865     0.7620
## log(playlists + 1)                 0.4130        0.3364     0.3943
## log(shouts + 1)                    3.0357        1.7159     1.2392
## log(tenure + 1)                    3.7453        3.6546     0.5750
## good_country                       0.3433        0.3547     0.4784
##                             Mean Diff eQQ Med eQQ Mean eQQ Max
## distance                       0.3532  0.3898   0.3532  0.5491
## log(age)                       0.0625  0.0572   0.0628  0.1823
## male                           0.0074  0.0000   0.0074  1.0000
## log(friend_cnt + 1)            1.3899  1.4553   1.3899  2.9793
## log(avg_friend_age + 1)        0.0649  0.0654   0.0651  0.4974
## log(avg_friend_male + 1)       0.0274  0.0465   0.0662  0.3102
## log(friend_country_cnt + 1)    0.8767  1.0116   0.8766  1.2993
## log(songsListened + 1)         1.6579  1.2880   1.6583  6.0283
## log(lovedTracks + 1)           1.5116  1.5315   1.5115  2.8332
## log(posts + 1)                 0.5526  0.0000   0.5525  2.3214
## log(playlists + 1)             0.0766  0.0000   0.0765  1.0986
## log(shouts + 1)                1.3198  1.5041   1.3197  2.2849
## log(tenure + 1)                0.0907  0.0770   0.0909  1.6094
## good_country                  -0.0114  0.0000   0.0114  1.0000
## 
## 
## Summary of balance for matched data:
##                             Means Treated Means Control SD Control
## distance                           0.4982        0.3447     0.1859
## log(age)                           3.2014        3.1883     0.2469
## male                               0.6363        0.6426     0.4793
## log(friend_cnt + 1)                3.3261        2.8534     0.8519
## log(avg_friend_age + 1)            3.2562        3.2510     0.2004
## log(avg_friend_male + 1)           0.4814        0.4810     0.1552
## log(friend_country_cnt + 1)        1.9993        1.6386     0.6274
## log(songsListened + 1)             9.6020        9.3647     1.7585
## log(lovedTracks + 1)               3.9582        3.5292     1.9237
## log(posts + 1)                     0.8391        0.5846     1.0732
## log(playlists + 1)                 0.4130        0.3892     0.4143
## log(shouts + 1)                    3.0357        2.5247     1.4340
## log(tenure + 1)                    3.7453        3.7322     0.5493
## good_country                       0.3433        0.3488     0.4766
##                             Mean Diff eQQ Med eQQ Mean eQQ Max
## distance                       0.1535  0.1730   0.1535  0.2964
## log(age)                       0.0131  0.0000   0.0148  0.3185
## male                          -0.0063  0.0000   0.0063  1.0000
## log(friend_cnt + 1)            0.4727  0.4055   0.4727  2.8447
## log(avg_friend_age + 1)        0.0051  0.0185   0.0202  0.2054
## log(avg_friend_male + 1)       0.0004  0.0061   0.0070  0.0496
## log(friend_country_cnt + 1)    0.3607  0.4055   0.3607  1.2040
## log(songsListened + 1)         0.2373  0.2152   0.2377  1.0986
## log(lovedTracks + 1)           0.4290  0.4754   0.4290  0.8702
## log(posts + 1)                 0.2545  0.0000   0.2545  1.6584
## log(playlists + 1)             0.0238  0.0000   0.0238  1.6011
## log(shouts + 1)                0.5110  0.5108   0.5110  2.2849
## log(tenure + 1)                0.0131  0.0000   0.0132  1.6094
## good_country                  -0.0055  0.0000   0.0055  1.0000
## 
## Percent Balance Improvement:
##                             Mean Diff.  eQQ Med eQQ Mean  eQQ Max
## distance                       56.5380  55.6227  56.5362  46.0241
## log(age)                       78.9816 100.0000  76.4853 -74.6660
## male                           14.9828   0.0000  15.0685   0.0000
## log(friend_cnt + 1)            65.9902  72.1385  65.9896   4.5196
## log(avg_friend_age + 1)        92.0737  71.7658  68.9221  58.6966
## log(avg_friend_male + 1)       98.4996  86.9324  89.4496  84.0090
## log(friend_country_cnt + 1)    58.8518  59.9185  58.8500   7.3356
## log(songsListened + 1)         85.6845  83.2950  85.6681  81.7757
## log(lovedTracks + 1)           71.6172  68.9565  71.6162  69.2871
## log(posts + 1)                 53.9493   0.0000  53.9396  28.5570
## log(playlists + 1)             68.9717   0.0000  68.9302 -45.7356
## log(shouts + 1)                61.2805  66.0373  61.2778   0.0000
## log(tenure + 1)                85.5660 100.0000  85.4254   0.0000
## good_country                   51.8523   0.0000  51.7857   0.0000
## 
## Sample sizes:
##           Control Treated
## All         34004    9823
## Matched      9823    9823
## Unmatched   24181       0
## Discarded       0       0
#plot(mod_match)

#create df to save match data
dta_m <- match.data(mod_match)
dim(dta_m)
## [1] 19646    19
#create a TableOne to see Standardized Mean Difference
match_table <- CreateTableOne(vars = xvars, strata = "treatment", data = dta_m, test = FALSE)
print(match_table, smd = TRUE)
##                                 Stratified by treatment
##                                  0                   1                  
##   n                                  9823                9823           
##   age (mean (SD))                   25.05 (6.99)        25.37 (6.97)    
##   male (mean (SD))                   0.64 (0.48)         0.64 (0.48)    
##   friend_cnt (mean (SD))            23.37 (22.62)       54.02 (127.91)  
##   avg_friend_age (mean (SD))        25.38 (6.04)        25.39 (5.17)    
##   avg_friend_male (mean (SD))        0.64 (0.24)         0.64 (0.23)    
##   friend_country_cnt (mean (SD))     5.32 (4.52)         9.39 (10.01)   
##   songsListened (mean (SD))      26514.56 (30768.27) 33735.64 (43952.34)
##   lovedTracks (mean (SD))          132.15 (276.54)     225.36 (498.23)  
##   posts (mean (SD))                  6.29 (57.19)       20.52 (241.27)  
##   playlists (mean (SD))              0.63 (0.94)         0.74 (1.96)    
##   shouts (mean (SD))                39.20 (138.80)     101.82 (739.51)  
##   tenure (mean (SD))                46.14 (19.84)       46.55 (19.92)   
##   good_country (mean (SD))           0.35 (0.48)         0.34 (0.47)    
##                                 Stratified by treatment
##                                  SMD   
##   n                                    
##   age (mean (SD))                 0.046
##   male (mean (SD))                0.013
##   friend_cnt (mean (SD))          0.334
##   avg_friend_age (mean (SD))      0.001
##   avg_friend_male (mean (SD))     0.003
##   friend_country_cnt (mean (SD))  0.524
##   songsListened (mean (SD))       0.190
##   lovedTracks (mean (SD))         0.231
##   posts (mean (SD))               0.081
##   playlists (mean (SD))           0.077
##   shouts (mean (SD))              0.118
##   tenure (mean (SD))              0.020
##   good_country (mean (SD))        0.012
#alternative way to see mean difference
df_colnames <- names(dta_m)[c(-1, -8, -14, -17, -18, -19)]
dta_m %>%
  group_by(treatment) %>%
  select(one_of(df_colnames)) %>%
  summarise_all(funs(mean))
## Adding missing grouping variables: `treatment`
## # A tibble: 2 x 14
##   treatment   age  male friend_cnt avg_friend_age avg_friend_male
##       <dbl> <dbl> <dbl>      <dbl>          <dbl>           <dbl>
## 1         0  25.1 0.643       23.4           25.4           0.636
## 2         1  25.4 0.636       54.0           25.4           0.636
## # … with 8 more variables: friend_country_cnt <dbl>, songsListened <dbl>,
## #   lovedTracks <dbl>, posts <dbl>, playlists <dbl>, shouts <dbl>,
## #   tenure <dbl>, good_country <dbl>
#t=test
output = NULL
for (i in 1:13){
  name = df_colnames[i]
  p_value = t.test(dta_m[, df_colnames[i]] ~ dta_m$treatment)$p.value
  output <- rbind(output, data.frame(name, p_value))
}
print(output) #all significant
##                  name       p_value
## 1                 age  1.199746e-03
## 2                male  3.569646e-01
## 3          friend_cnt 5.640113e-118
## 4      avg_friend_age  9.326247e-01
## 5     avg_friend_male  8.444775e-01
## 6  friend_country_cnt 9.248697e-282
## 7       songsListened  2.141575e-40
## 8         lovedTracks  1.266920e-58
## 9               posts  1.307063e-08
## 10          playlists  7.480149e-08
## 11             shouts  1.795939e-16
## 12             tenure  1.523959e-01
## 13       good_country  4.180329e-01
# Estimating treatment effects
with(dta_m, t.test(adopter ~ treatment))
## 
##  Welch Two Sample t-test
## 
## data:  adopter by treatment
## t = -17.858, df = 18260, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09569061 -0.07676179
## sample estimates:
## mean in group 0 mean in group 1 
##       0.0913163       0.1775425

I still got significant difference of all variables between control group and treatment group after nearest method matching. So does the result of t-test between adopter and treatment. The p-value of t-test is significant. As a result, I chose another matching method, the subclass method, to see if it can balance observed covariates of two groups better.

Visual inspection to examine covariate balance in the matched sample. If the scatter points are located near the trend line, it means the matching is good. Yet the below plots show the nearest method still have improvement of well matching the treatment and control samples in all covariates.

fn_bal <- function(dta, variable) {
  dta$variable <- dta[, variable]
  support <- c(min(dta$variable), max(dta$variable))
  ggplot(dta, aes(x = distance, y = variable, color = treatment)) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") +
    ylab(variable) +
    theme_bw() +
    ylim(support)
}


grid.arrange(
  fn_bal(dta_m, "age"),
  fn_bal(dta_m, "male") + theme(legend.position = "none"),
  fn_bal(dta_m, "friend_cnt"),
  fn_bal(dta_m, "avg_friend_age") + theme(legend.position = "none"),
  fn_bal(dta_m, "avg_friend_male"),
  fn_bal(dta_m, "friend_country_cnt") + theme(legend.position = "none"),
  fn_bal(dta_m, "subscriber_friend_cnt"),
  fn_bal(dta_m, "songsListened") + theme(legend.position = "none"),
  nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

grid.arrange(
  fn_bal(dta_m, "lovedTracks"),
  fn_bal(dta_m, "posts") + theme(legend.position = "none"),
  fn_bal(dta_m, "playlists"),
  fn_bal(dta_m, "shouts") + theme(legend.position = "none"),
  fn_bal(dta_m, "adopter"),
  fn_bal(dta_m, "tenure") + theme(legend.position = "none"),
  fn_bal(dta_m, "good_country"),
  nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing missing values (geom_smooth).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Propensity Score Matching – Method 2: subclassification (subclass) The subclass method is to loosely put “similar’’ observations (both treatment and control) into the same “subclass”. In this method, the treatment and control would be split into 6 subclasses. I decided to select the subclasses with lower mean difference than the value got from the nearest neighbor method, which are subclass 2,3,4,5 and subclass 2, 4 having better performance. I ended up choosing subclass 2, it has pretty low mean difference 0.093 and the mean distances of treatment and control are 0.2532 and 0.2439.

For subclass 2, their treatment and control samples are 5026 and 1637. Looking to the overall performance of all subclasses, we got the mean distances of treatment and control 0.4982 and 0.4792, respectively. The mean difference is 0.0092. It improves balance of two groups by 94.6247%, in terms of their distance (propensity score).

Mean difference for all variables refers to the below table. All the standardized mean differences, SMD, are lower than 0.1. It means that there is no huge difference between treated and control for all variables.

#subclass matching approach using subclass method
mod_match2 <- matchit(treatment ~ log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1)  + log(avg_friend_male+1) + log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) + log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, method = 'subclass', data = df)
                   
summary(mod_match2, covariates = T)
## 
## Call:
## matchit(formula = treatment ~ log(age) + male + log(friend_cnt + 
##     1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) + 
##     log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks + 
##     1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) + 
##     log(tenure + 1) + good_country, data = df, method = "subclass", 
##     sub.by = "treat")
## Summary of balance for all data:
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.4982        0.1450    0.3532  0.3898
## log(age)                           3.2014        3.1389    0.0625  0.0572
## male                               0.6363        0.6288    0.0074  0.0000
## log(friend_cnt + 1)                3.3261        1.9362    1.3899  1.4553
## log(avg_friend_age + 1)            3.2562        3.1912    0.0649  0.0654
## log(avg_friend_male + 1)           0.4814        0.4540    0.0274  0.0465
## log(friend_country_cnt + 1)        1.9993        1.1226    0.8767  1.0116
## log(songsListened + 1)             9.6020        7.9441    1.6579  1.2880
## log(lovedTracks + 1)               3.9582        2.4466    1.5116  1.5315
## log(posts + 1)                     0.8391        0.2865    0.5526  0.0000
## log(playlists + 1)                 0.4130        0.3364    0.0766  0.0000
## log(shouts + 1)                    3.0357        1.7159    1.3198  1.5041
## log(tenure + 1)                    3.7453        3.6546    0.0907  0.0770
## good_country                       0.3433        0.3547   -0.0114  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.3532  0.5491
## log(age)                      0.0628  0.1823
## male                          0.0074  1.0000
## log(friend_cnt + 1)           1.3899  2.9793
## log(avg_friend_age + 1)       0.0651  0.4974
## log(avg_friend_male + 1)      0.0662  0.3102
## log(friend_country_cnt + 1)   0.8766  1.2993
## log(songsListened + 1)        1.6583  6.0283
## log(lovedTracks + 1)          1.5115  2.8332
## log(posts + 1)                0.5525  2.3214
## log(playlists + 1)            0.0765  1.0986
## log(shouts + 1)               1.3197  2.2849
## log(tenure + 1)               0.0909  1.6094
## good_country                  0.0114  1.0000
## 
## 
## Summary of balance by subclasses:
## , , Subclass 1
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.0958        0.0590    0.0368  0.0413
## log(age)                           3.1476        3.1177    0.0299  0.0392
## male                               0.6512        0.6226    0.0286  0.0000
## log(friend_cnt + 1)                1.9106        1.5368    0.3738  0.4055
## log(avg_friend_age + 1)            3.2050        3.1660    0.0389  0.0396
## log(avg_friend_male + 1)           0.4690        0.4436    0.0254  0.0155
## log(friend_country_cnt + 1)        1.0560        0.9001    0.1559  0.0000
## log(songsListened + 1)             8.2132        7.3336    0.8796  0.6968
## log(lovedTracks + 1)               2.5337        1.9785    0.5552  0.6022
## log(posts + 1)                     0.2298        0.1579    0.0718  0.0000
## log(playlists + 1)                 0.3265        0.3153    0.0112  0.0000
## log(shouts + 1)                    1.6350        1.3627    0.2724  0.3365
## log(tenure + 1)                    3.6255        3.6195    0.0060  0.0000
## good_country                       0.3531        0.3566   -0.0035  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0367  0.0517
## log(age)                      0.0314  0.3090
## male                          0.0287  1.0000
## log(friend_cnt + 1)           0.3739  0.6931
## log(avg_friend_age + 1)       0.0396  0.4974
## log(avg_friend_male + 1)      0.0468  0.2877
## log(friend_country_cnt + 1)   0.1560  0.6931
## log(songsListened + 1)        0.8811  3.4340
## log(lovedTracks + 1)          0.5546  1.3863
## log(posts + 1)                0.0730  2.4556
## log(playlists + 1)            0.0138  2.1972
## log(shouts + 1)               0.2711  0.8548
## log(tenure + 1)               0.0127  0.9163
## good_country                  0.0037  1.0000
## 
## , , Subclass 2
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.2532        0.2439    0.0093  0.0103
## log(age)                           3.1820        3.1865   -0.0045  0.0000
## male                               0.6402        0.6546   -0.0144  0.0000
## log(friend_cnt + 1)                2.6798        2.6449    0.0349  0.0000
## log(avg_friend_age + 1)            3.2496        3.2505   -0.0009  0.0147
## log(avg_friend_male + 1)           0.4769        0.4807   -0.0039  0.0026
## log(friend_country_cnt + 1)        1.4616        1.4356    0.0260  0.0000
## log(songsListened + 1)             9.3979        9.3021    0.0958  0.0589
## log(lovedTracks + 1)               3.3213        3.3103    0.0110  0.0198
## log(posts + 1)                     0.4710        0.4618    0.0092  0.0000
## log(playlists + 1)                 0.3764        0.3716    0.0048  0.0000
## log(shouts + 1)                    2.2785        2.2897   -0.0112  0.0000
## log(tenure + 1)                    3.7383        3.7424   -0.0041  0.0126
## good_country                       0.3561        0.3703   -0.0141  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0094  0.0144
## log(age)                      0.0194  0.1671
## male                          0.0141  1.0000
## log(friend_cnt + 1)           0.0442  0.4055
## log(avg_friend_age + 1)       0.0205  0.2336
## log(avg_friend_male + 1)      0.0056  0.0760
## log(friend_country_cnt + 1)   0.0327  0.6931
## log(songsListened + 1)        0.0988  2.5649
## log(lovedTracks + 1)          0.0327  0.6931
## log(posts + 1)                0.0218  0.8694
## log(playlists + 1)            0.0199  0.6931
## log(shouts + 1)               0.0355  0.6931
## log(tenure + 1)               0.0164  0.2513
## good_country                  0.0141  1.0000
## 
## , , Subclass 3
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.4145        0.4048    0.0096  0.0103
## log(age)                           3.2024        3.2001    0.0022  0.0000
## male                               0.6512        0.6352    0.0160  0.0000
## log(friend_cnt + 1)                3.1062        3.0719    0.0343  0.0465
## log(avg_friend_age + 1)            3.2638        3.2646   -0.0008  0.0203
## log(avg_friend_male + 1)           0.4812        0.4860   -0.0048  0.0036
## log(friend_country_cnt + 1)        1.7901        1.7722    0.0179  0.0000
## log(songsListened + 1)             9.6568        9.6260    0.0309  0.0178
## log(lovedTracks + 1)               3.7538        3.7950   -0.0412  0.0343
## log(posts + 1)                     0.6524        0.6417    0.0107  0.0000
## log(playlists + 1)                 0.4014        0.4104   -0.0090  0.0000
## log(shouts + 1)                    2.7325        2.7427   -0.0102  0.0408
## log(tenure + 1)                    3.7625        3.7453    0.0172  0.0000
## good_country                       0.3470        0.3370    0.0100  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0097  0.0170
## log(age)                      0.0222  0.2624
## male                          0.0165  1.0000
## log(friend_cnt + 1)           0.0641  0.4055
## log(avg_friend_age + 1)       0.0251  0.1974
## log(avg_friend_male + 1)      0.0053  0.0572
## log(friend_country_cnt + 1)   0.0433  0.6931
## log(songsListened + 1)        0.0476  1.9459
## log(lovedTracks + 1)          0.0777  0.6931
## log(posts + 1)                0.0176  0.6931
## log(playlists + 1)            0.0208  0.6931
## log(shouts + 1)               0.0589  0.6931
## log(tenure + 1)               0.0257  1.6094
## good_country                  0.0104  1.0000
## 
## , , Subclass 4
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.5774        0.5681    0.0092  0.0098
## log(age)                           3.2074        3.2117   -0.0044  0.0339
## male                               0.6420        0.6427   -0.0007  0.0000
## log(friend_cnt + 1)                3.4863        3.4497    0.0366  0.0385
## log(avg_friend_age + 1)            3.2712        3.2716   -0.0004  0.0287
## log(avg_friend_male + 1)           0.4885        0.4772    0.0114  0.0109
## log(friend_country_cnt + 1)        2.1122        2.0989    0.0133  0.0000
## log(songsListened + 1)             9.8758        9.8732    0.0026  0.0187
## log(lovedTracks + 1)               4.2332        4.2037    0.0295  0.0744
## log(posts + 1)                     0.9157        0.8223    0.0933  0.0000
## log(playlists + 1)                 0.4235        0.4083    0.0152  0.0000
## log(shouts + 1)                    3.1661        3.0812    0.0848  0.0690
## log(tenure + 1)                    3.7982        3.7593    0.0389  0.0308
## good_country                       0.3238        0.3213    0.0024  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0092  0.0179
## log(age)                      0.0347  0.4055
## male                          0.0008  1.0000
## log(friend_cnt + 1)           0.0767  0.8473
## log(avg_friend_age + 1)       0.0356  0.3167
## log(avg_friend_male + 1)      0.0131  0.0796
## log(friend_country_cnt + 1)   0.0501  0.6931
## log(songsListened + 1)        0.0423  0.8708
## log(lovedTracks + 1)          0.0936  0.8702
## log(posts + 1)                0.0927  0.6931
## log(playlists + 1)            0.0147  0.6931
## log(shouts + 1)               0.0827  0.6931
## log(tenure + 1)               0.0380  0.3185
## good_country                  0.0024  1.0000
## 
## , , Subclass 5
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.7421        0.7291    0.0130  0.0147
## log(age)                           3.2141        3.1921    0.0220  0.0392
## male                               0.6115        0.6030    0.0085  0.0000
## log(friend_cnt + 1)                3.9658        3.9556    0.0102  0.0308
## log(avg_friend_age + 1)            3.2691        3.2461    0.0230  0.0392
## log(avg_friend_male + 1)           0.4820        0.4703    0.0117  0.0140
## log(friend_country_cnt + 1)        2.4754        2.5116   -0.0362  0.0426
## log(songsListened + 1)            10.1379       10.0651    0.0728  0.0616
## log(lovedTracks + 1)               4.6110        4.5637    0.0473  0.1342
## log(posts + 1)                     1.1444        1.2230   -0.0786  0.0000
## log(playlists + 1)                 0.4371        0.4315    0.0056  0.0000
## log(shouts + 1)                    3.7873        3.7217    0.0656  0.0392
## log(tenure + 1)                    3.7989        3.7379    0.0609  0.0551
## good_country                       0.3445        0.2990    0.0455  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0130  0.0196
## log(age)                      0.0301  0.3365
## male                          0.0083  1.0000
## log(friend_cnt + 1)           0.0481  0.4055
## log(avg_friend_age + 1)       0.0339  0.1603
## log(avg_friend_male + 1)      0.0125  0.0509
## log(friend_country_cnt + 1)   0.0435  0.4055
## log(songsListened + 1)        0.0979  3.2581
## log(lovedTracks + 1)          0.1446  1.0986
## log(posts + 1)                0.1087  1.4976
## log(playlists + 1)            0.0187  0.8910
## log(shouts + 1)               0.0819  1.1214
## log(tenure + 1)               0.0595  0.2683
## good_country                  0.0449  1.0000
## 
## , , Subclass 6
## 
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.9058        0.8699    0.0359  0.0376
## log(age)                           3.2550        3.1733    0.0817  0.0834
## male                               0.6215        0.6752   -0.0537  0.0000
## log(friend_cnt + 1)                4.8071        4.5569    0.2502  0.1363
## log(avg_friend_age + 1)            3.2785        3.2334    0.0451  0.0490
## log(avg_friend_male + 1)           0.4910        0.4813    0.0097  0.0085
## log(friend_country_cnt + 1)        3.0998        2.9599    0.1399  0.1092
## log(songsListened + 1)            10.3301       10.3882   -0.0581  0.0674
## log(lovedTracks + 1)               5.2952        5.2018    0.0934  0.2087
## log(posts + 1)                     1.6209        1.6282   -0.0073  0.0000
## log(playlists + 1)                 0.5132        0.4153    0.0979  0.0000
## log(shouts + 1)                    4.6137        4.4246    0.1892  0.1565
## log(tenure + 1)                    3.7482        3.7621   -0.0139  0.0339
## good_country                       0.3352        0.3077    0.0275  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0360  0.0657
## log(age)                      0.0862  0.3185
## male                          0.0513  1.0000
## log(friend_cnt + 1)           0.2706  2.8447
## log(avg_friend_age + 1)       0.0522  0.2537
## log(avg_friend_male + 1)      0.0105  0.0521
## log(friend_country_cnt + 1)   0.1396  1.1823
## log(songsListened + 1)        0.1566  5.8493
## log(lovedTracks + 1)          0.2823  1.7918
## log(posts + 1)                0.1632  1.4898
## log(playlists + 1)            0.1159  3.3928
## log(shouts + 1)               0.2248  2.5425
## log(tenure + 1)               0.0423  0.3365
## good_country                  0.0256  1.0000
## 
## 
## Sample sizes by subclasses:
##         Subclass 1 Subclass 2 Subclass 3 Subclass 4 Subclass 5 Subclass 6
## Treated       1637       1637       1637       1637       1637       1638
## Control      24527       5026       2481       1251        602        117
## Total        26164       6663       4118       2888       2239       1755
## 
## Summary of balance across subclasses
##                             Means Treated Means Control Mean Diff eQQ Med
## distance                           0.4982        0.4792    0.0092  0.0207
## log(age)                           3.2014        3.1803    0.0150  0.0326
## male                               0.6363        0.6389    0.0109  0.0000
## log(friend_cnt + 1)                3.3261        3.2028    0.0757  0.1096
## log(avg_friend_age + 1)            3.2562        3.2387    0.0106  0.0319
## log(avg_friend_male + 1)           0.4814        0.4732    0.0054  0.0092
## log(friend_country_cnt + 1)        1.9993        1.9465    0.0359  0.0253
## log(songsListened + 1)             9.6020        9.4315    0.1484  0.1535
## log(lovedTracks + 1)               3.9582        3.8423    0.0946  0.1789
## log(posts + 1)                     0.8391        0.8226    0.0237  0.0000
## log(playlists + 1)                 0.4130        0.3921    0.0167  0.0000
## log(shouts + 1)                    3.0357        2.9372    0.0581  0.1070
## log(tenure + 1)                    3.7453        3.7278    0.0127  0.0221
## good_country                       0.3433        0.3320    0.0093  0.0000
##                             eQQ Mean eQQ Max
## distance                      0.0190  0.0311
## log(age)                      0.0373  0.2998
## male                          0.0199  1.0000
## log(friend_cnt + 1)           0.1463  0.9338
## log(avg_friend_age + 1)       0.0345  0.2765
## log(avg_friend_male + 1)      0.0156  0.1006
## log(friend_country_cnt + 1)   0.0775  0.7268
## log(songsListened + 1)        0.2207  2.9875
## log(lovedTracks + 1)          0.1976  1.0889
## log(posts + 1)                0.0795  1.2831
## log(playlists + 1)            0.0340  1.4269
## log(shouts + 1)               0.1258  1.0998
## log(tenure + 1)               0.0324  0.6167
## good_country                  0.0168  1.0000
## 
## Percent Balance Improvement:
##                             Mean Diff. eQQ Med  eQQ Mean  eQQ Max
## distance                       94.6247 94.6950   94.6203  94.3445
## log(age)                       66.1718 42.9202   40.5834 -64.4371
## male                           64.5644  0.0000 -168.3629   0.0000
## log(friend_cnt + 1)            91.1270 92.4691   89.4753  68.6580
## log(avg_friend_age + 1)        73.0777 51.2211   47.0466  44.4094
## log(avg_friend_male + 1)       69.9300 80.2584   76.3880  67.5745
## log(friend_country_cnt + 1)    93.9768 97.4988   91.1556  44.0635
## log(songsListened + 1)         89.7115 88.0812   86.6907  50.4423
## log(lovedTracks + 1)           92.3338 88.3170   86.9274  61.5657
## log(posts + 1)                 97.0081  0.0000   85.6106  44.7252
## log(playlists + 1)             72.6598  0.0000   55.6367 -29.8861
## log(shouts + 1)                92.5414 92.8857   90.4645  51.8639
## log(tenure + 1)                80.7125 71.3450   64.3242  61.6837
## good_country                    1.0151  0.0000  -47.6303   0.0000
#plot(mod_match2)

dta_m2 <- match.data(mod_match2)
dta_m2 <- filter(dta_m2, subclass == 2)

#create a TableOne to see Standardized Mean Difference
match_table <- CreateTableOne(vars = xvars, strata = "treatment", data = dta_m2, test = FALSE)
print(match_table, smd = TRUE)
##                                 Stratified by treatment
##                                  0                   1                  
##   n                                  5026                1637           
##   age (mean (SD))                   24.96 (6.79)        24.72 (6.02)    
##   male (mean (SD))                   0.65 (0.48)         0.64 (0.48)    
##   friend_cnt (mean (SD))            15.55 (8.93)        15.81 (8.79)    
##   avg_friend_age (mean (SD))        25.33 (5.70)        25.17 (4.83)    
##   avg_friend_male (mean (SD))        0.64 (0.24)         0.63 (0.24)    
##   friend_country_cnt (mean (SD))     3.67 (2.22)         3.75 (2.17)    
##   songsListened (mean (SD))      23405.49 (26688.87) 24706.95 (28220.07)
##   lovedTracks (mean (SD))           98.99 (200.87)     103.87 (243.31)  
##   posts (mean (SD))                  3.55 (29.95)        3.43 (23.22)   
##   playlists (mean (SD))              0.58 (0.76)         0.63 (1.13)    
##   shouts (mean (SD))                24.26 (69.94)       26.55 (84.56)   
##   tenure (mean (SD))                46.32 (19.65)       46.42 (19.99)   
##   good_country (mean (SD))           0.37 (0.48)         0.36 (0.48)    
##                                 Stratified by treatment
##                                  SMD   
##   n                                    
##   age (mean (SD))                 0.039
##   male (mean (SD))                0.030
##   friend_cnt (mean (SD))          0.029
##   avg_friend_age (mean (SD))      0.029
##   avg_friend_male (mean (SD))     0.024
##   friend_country_cnt (mean (SD))  0.036
##   songsListened (mean (SD))       0.047
##   lovedTracks (mean (SD))         0.022
##   posts (mean (SD))               0.004
##   playlists (mean (SD))           0.052
##   shouts (mean (SD))              0.029
##   tenure (mean (SD))              0.005
##   good_country (mean (SD))        0.029
#alternative way to see mean difference
df_colnames <- names(dta_m2)[c(-1, -8, -14, -17, -18, -19)]
dta_m2 %>% 
  group_by(treatment) %>%
  select(one_of(df_colnames)) %>%
  summarise_all(funs(mean))
## Adding missing grouping variables: `treatment`
## # A tibble: 2 x 15
##   treatment   age  male friend_cnt avg_friend_age avg_friend_male
##       <dbl> <dbl> <dbl>      <dbl>          <dbl>           <dbl>
## 1         0  25.0 0.655       15.5           25.3           0.635
## 2         1  24.7 0.640       15.8           25.2           0.629
## # … with 9 more variables: friend_country_cnt <dbl>, songsListened <dbl>,
## #   lovedTracks <dbl>, posts <dbl>, playlists <dbl>, shouts <dbl>,
## #   tenure <dbl>, good_country <dbl>, subclass <dbl>
#t=test
output = NULL
for (i in 1:13){
  name = df_colnames[i]
  p_value = t.test(dta_m2[, df_colnames[i]] ~ dta_m2$treatment)$p.value
  output <- rbind(output, data.frame(name, p_value))
}
print(output) 
##                  name    p_value
## 1                 age 0.16059481
## 2                male 0.29083550
## 3          friend_cnt 0.30200462
## 4      avg_friend_age 0.28573427
## 5     avg_friend_male 0.39390220
## 6  friend_country_cnt 0.20644393
## 7       songsListened 0.10070373
## 8         lovedTracks 0.46305805
## 9               posts 0.87305044
## 10          playlists 0.09660396
## 11             shouts 0.32230035
## 12             tenure 0.87002629
## 13       good_country 0.30081295

In terms of t-test, all p-value in the subclass2 are not statistically significant (>0.05). So, there is no different between treatment and control group, which means that the observed covariates of two groups are balance.

Visual Inspection: most of points are located near the trend line

#Examining covariate balance in the matched sample
grid.arrange(
  fn_bal(dta_m2, "age"),
  fn_bal(dta_m2, "male") + theme(legend.position = "none"),
  fn_bal(dta_m2, "friend_cnt"),
  fn_bal(dta_m2, "avg_friend_age") + theme(legend.position = "none"),
  fn_bal(dta_m2, "avg_friend_male"),
  fn_bal(dta_m2, "friend_country_cnt") + theme(legend.position = "none"),
  fn_bal(dta_m2, "subscriber_friend_cnt"),
  fn_bal(dta_m2, "songsListened") + theme(legend.position = "none"),
  nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

grid.arrange(
  fn_bal(dta_m2, "lovedTracks"),
  fn_bal(dta_m2, "posts") + theme(legend.position = "none"),
  fn_bal(dta_m2, "playlists"),
  fn_bal(dta_m2, "shouts") + theme(legend.position = "none"),
  fn_bal(dta_m2, "adopter"),
  fn_bal(dta_m2, "tenure") + theme(legend.position = "none"),
  fn_bal(dta_m2, "good_country"),
  nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

#put log data back to df
#for (col in df_colnames) {
 # tmp = mutate(dta_m, col = log(dta_m[,col]))
#}
dta_m = mutate(dta_m, age_log = log(dta_m[,'age']+1))
dta_m = mutate(dta_m, male_new = dta_m[,'male'])
dta_m = mutate(dta_m, friend_cnt_log = log(dta_m[,'friend_cnt'])+1)
dta_m = mutate(dta_m, avg_friend_age_log = log(dta_m[,'avg_friend_age']+1))
dta_m = mutate(dta_m, avg_friend_male_log = log(dta_m[,'avg_friend_male']+1))
dta_m = mutate(dta_m, friend_country_cnt_log = log(dta_m[,'friend_country_cnt']+1))
dta_m = mutate(dta_m, songsListened_log = log(dta_m[,'songsListened']+1))
dta_m = mutate(dta_m, lovedTracks_log = log(dta_m[,'lovedTracks']+1))
dta_m = mutate(dta_m, posts_log = log(dta_m[,'posts']+1))
dta_m = mutate(dta_m, playlists_log = log(dta_m[,'playlists']+1))
dta_m = mutate(dta_m, shouts_log = log(dta_m[,'shouts']+1))
dta_m = mutate(dta_m, tenure_log = log(dta_m[,'tenure']+1))
dta_m = mutate(dta_m, good_country_new = dta_m[,'good_country'])


dta_m2 = mutate(dta_m2, age_log = log(dta_m2[,'age']+1))
dta_m2 = mutate(dta_m2, male_new = dta_m2[,'male'])
dta_m2 = mutate(dta_m2, friend_cnt_log = log(dta_m2[,'friend_cnt'])+1)
dta_m2 = mutate(dta_m2, avg_friend_age_log = log(dta_m2[,'avg_friend_age']+1))
dta_m2 = mutate(dta_m2, avg_friend_male_log = log(dta_m2[,'avg_friend_male']+1))
dta_m2 = mutate(dta_m2, friend_country_cnt_log = log(dta_m2[,'friend_country_cnt']+1))
dta_m2 = mutate(dta_m2, songsListened_log = log(dta_m2[,'songsListened']+1))
dta_m2 = mutate(dta_m2, lovedTracks_log = log(dta_m2[,'lovedTracks']+1))
dta_m2 = mutate(dta_m2, posts_log = log(dta_m2[,'posts']+1))
dta_m2 = mutate(dta_m2, playlists_log = log(dta_m2[,'playlists']+1))
dta_m2 = mutate(dta_m2, shouts_log = log(dta_m2[,'shouts']+1))
dta_m2 = mutate(dta_m2, tenure_log = log(dta_m2[,'tenure']+1))
dta_m2 = mutate(dta_m2, good_country_new = dta_m2[,'good_country'])
#mean differences of two natching methods 
cols <- colnames(dta_m)[20:32]

#1
match_table <- CreateTableOne(vars = cols, strata = "treatment", data = dta_m, test = FALSE)
print(match_table, smd = TRUE)
##                                     Stratified by treatment
##                                      0           1           SMD   
##   n                                  9823        9823              
##   age_log (mean (SD))                3.23 (0.24) 3.24 (0.24)  0.053
##   male_new (mean (SD))               0.64 (0.48) 0.64 (0.48)  0.013
##   friend_cnt_log (mean (SD))         3.76 (0.95) 4.25 (1.19)  0.460
##   avg_friend_age_log (mean (SD))     3.25 (0.20) 3.26 (0.18)  0.027
##   avg_friend_male_log (mean (SD))    0.48 (0.16) 0.48 (0.15)  0.003
##   friend_country_cnt_log (mean (SD)) 1.64 (0.63) 2.00 (0.81)  0.498
##   songsListened_log (mean (SD))      9.36 (1.76) 9.60 (1.75)  0.135
##   lovedTracks_log (mean (SD))        3.53 (1.92) 3.96 (2.05)  0.216
##   posts_log (mean (SD))              0.58 (1.07) 0.84 (1.36)  0.208
##   playlists_log (mean (SD))          0.39 (0.41) 0.41 (0.46)  0.054
##   shouts_log (mean (SD))             2.52 (1.43) 3.04 (1.69)  0.326
##   tenure_log (mean (SD))             3.73 (0.55) 3.75 (0.53)  0.024
##   good_country_new (mean (SD))       0.35 (0.48) 0.34 (0.47)  0.012
#
#dta_m %>%
#  group_by(treatment) %>%
#  select(one_of(cols)) %>%
#  summarise_all(funs(mean))

#2
match_table <- CreateTableOne(vars = cols, strata = "treatment", data = dta_m2, test = FALSE)
print(match_table, smd = TRUE)
##                                     Stratified by treatment
##                                      0           1           SMD   
##   n                                  5026        1637              
##   age_log (mean (SD))                3.23 (0.23) 3.22 (0.21)  0.020
##   male_new (mean (SD))               0.65 (0.48) 0.64 (0.48)  0.030
##   friend_cnt_log (mean (SD))         3.55 (0.69) 3.59 (0.63)  0.064
##   avg_friend_age_log (mean (SD))     3.25 (0.19) 3.25 (0.17)  0.005
##   avg_friend_male_log (mean (SD))    0.48 (0.15) 0.48 (0.15)  0.025
##   friend_country_cnt_log (mean (SD)) 1.44 (0.46) 1.46 (0.44)  0.058
##   songsListened_log (mean (SD))      9.30 (1.67) 9.40 (1.58)  0.059
##   lovedTracks_log (mean (SD))        3.31 (1.86) 3.32 (1.88)  0.006
##   posts_log (mean (SD))              0.46 (0.92) 0.47 (0.91)  0.010
##   playlists_log (mean (SD))          0.37 (0.40) 0.38 (0.43)  0.011
##   shouts_log (mean (SD))             2.29 (1.27) 2.28 (1.30)  0.009
##   tenure_log (mean (SD))             3.74 (0.53) 3.74 (0.55)  0.008
##   good_country_new (mean (SD))       0.37 (0.48) 0.36 (0.48)  0.029

treatment effect Finally, I used logistic regression to double check the result of the models to see if the treatment and control samples got balanced.

#t-test
output = NULL
for (i in 1:13){
  name = cols[i]
  p_value = t.test(dta_m[, cols[i]] ~ dta_m$treatment)$p.value
  output <- rbind(output, data.frame(name, p_value))
}
print(output) #all significant
##                      name       p_value
## 1                 age_log  1.979079e-04
## 2                male_new  3.569646e-01
## 3          friend_cnt_log 7.789995e-222
## 4      avg_friend_age_log  5.738469e-02
## 5     avg_friend_male_log  8.501290e-01
## 6  friend_country_cnt_log 4.196158e-258
## 7       songsListened_log  3.106254e-21
## 8         lovedTracks_log  1.843262e-51
## 9               posts_log  9.008717e-48
## 10          playlists_log  1.459077e-04
## 11             shouts_log 3.699910e-114
## 12             tenure_log  9.003016e-02
## 13       good_country_new  4.180329e-01
#treatment effect
lm_test1 <- glm(adopter~treatment, data = dta_m, family = binomial(), )
summary(lm_test1)
## 
## Call:
## glm(formula = adopter ~ treatment, family = binomial(), data = dta_m)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6252  -0.6252  -0.4376  -0.4376   2.1879  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.29767    0.03503  -65.60   <2e-16 ***
## treatment    0.76458    0.04386   17.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15509  on 19645  degrees of freedom
## Residual deviance: 15191  on 19644  degrees of freedom
## AIC: 15195
## 
## Number of Fisher Scoring iterations: 5
lm_test2 <- glm(adopter~treatment + age_log + male_new + friend_cnt_log + avg_friend_age_log
                + avg_friend_male_log + friend_country_cnt_log + songsListened_log + lovedTracks_log 
                + posts_log + playlists_log + shouts_log + tenure_log, data = dta_m, family = binomial())
summary(lm_test2)
## 
## Call:
## glm(formula = adopter ~ treatment + age_log + male_new + friend_cnt_log + 
##     avg_friend_age_log + avg_friend_male_log + friend_country_cnt_log + 
##     songsListened_log + lovedTracks_log + posts_log + playlists_log + 
##     shouts_log + tenure_log, family = binomial(), data = dta_m)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6856  -0.5752  -0.4304  -0.2971   2.9661  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -9.242989   0.501279 -18.439  < 2e-16 ***
## treatment               0.578626   0.047585  12.160  < 2e-16 ***
## age_log                 0.608481   0.148409   4.100 4.13e-05 ***
## male_new                0.261536   0.049733   5.259 1.45e-07 ***
## friend_cnt_log          0.070675   0.038618   1.830  0.06723 .  
## avg_friend_age_log      0.873225   0.201824   4.327 1.51e-05 ***
## avg_friend_male_log     0.182963   0.156736   1.167  0.24308    
## friend_country_cnt_log  0.006577   0.048278   0.136  0.89164    
## songsListened_log       0.204511   0.019736  10.362  < 2e-16 ***
## lovedTracks_log         0.251007   0.013486  18.613  < 2e-16 ***
## posts_log               0.128037   0.018053   7.092 1.32e-12 ***
## playlists_log           0.154962   0.048393   3.202  0.00136 ** 
## shouts_log             -0.085773   0.019328  -4.438 9.09e-06 ***
## tenure_log             -0.349142   0.048227  -7.240 4.50e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15509  on 19645  degrees of freedom
## Residual deviance: 14115  on 19632  degrees of freedom
## AIC: 14143
## 
## Number of Fisher Scoring iterations: 5
#treatment effect
lm_test3 <- glm(adopter~treatment, data = dta_m2, family = binomial(), )
summary(lm_test3)
## 
## Call:
## glm(formula = adopter ~ treatment, family = binomial(), data = dta_m2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5682  -0.4025  -0.4025  -0.4025   2.2599  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.47268    0.05266 -46.954   <2e-16 ***
## treatment    0.73064    0.08712   8.387   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.9  on 6662  degrees of freedom
## Residual deviance: 4126.3  on 6661  degrees of freedom
## AIC: 4130.3
## 
## Number of Fisher Scoring iterations: 5
lm_test4 <- glm(adopter~treatment + age_log + male_new + friend_cnt_log + avg_friend_age_log
                + avg_friend_male_log + friend_country_cnt_log + songsListened_log + lovedTracks_log 
                + posts_log + playlists_log + shouts_log + tenure_log, data = dta_m2, family = binomial())
summary(lm_test4)
## 
## Call:
## glm(formula = adopter ~ treatment + age_log + male_new + friend_cnt_log + 
##     avg_friend_age_log + avg_friend_male_log + friend_country_cnt_log + 
##     songsListened_log + lovedTracks_log + posts_log + playlists_log + 
##     shouts_log + tenure_log, family = binomial(), data = dta_m2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5171  -0.4840  -0.3640  -0.2601   2.9942  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -17.06130    3.18160  -5.362 8.21e-08 ***
## treatment                0.71601    0.09032   7.927 2.24e-15 ***
## age_log                  0.92899    0.32490   2.859 0.004245 ** 
## male_new                 0.27716    0.10186   2.721 0.006507 ** 
## friend_cnt_log           0.55987    0.18248   3.068 0.002154 ** 
## avg_friend_age_log       2.42208    0.73048   3.316 0.000914 ***
## avg_friend_male_log      0.50759    0.29989   1.693 0.090532 .  
## friend_country_cnt_log   0.23115    0.13531   1.708 0.087571 .  
## songsListened_log        0.22339    0.03900   5.728 1.01e-08 ***
## lovedTracks_log          0.33988    0.03015  11.272  < 2e-16 ***
## posts_log                0.13697    0.05079   2.697 0.006999 ** 
## playlists_log            0.29315    0.10631   2.757 0.005827 ** 
## shouts_log              -0.21240    0.04138  -5.133 2.86e-07 ***
## tenure_log              -0.57488    0.11308  -5.084 3.70e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.9  on 6662  degrees of freedom
## Residual deviance: 3832.6  on 6649  degrees of freedom
## AIC: 3860.6
## 
## Number of Fisher Scoring iterations: 6

It shows that when only looking to the relationship between adopter and treatment, the p-value is significant. The coefficient of treatment variable is 0.73064, positively correlated with the adopter. However, if I put all the other variables (demographic variables, peer influence, user engagement variables) into logistic regression, the coefficient of treatment is still 0.71 or so, close to 0.73. Therefore, it means that whether or not I put other variables, the correlation of adopter and treatment is not affected. The treatment effect is solved.

#4. Regression Analyses #logistic regression

#put all variables into the regression
lm1<- lm(adopter~  log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1)  + log(avg_friend_male+1) +
                       log(friend_country_cnt+1) + log(subscriber_friend_cnt + 1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, data = df, family = binomial())
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded
summary(lm1)
## 
## Call:
## lm(formula = adopter ~ log(age) + male + log(friend_cnt + 1) + 
##     log(avg_friend_age + 1) + log(avg_friend_male + 1) + log(friend_country_cnt + 
##     1) + log(subscriber_friend_cnt + 1) + log(songsListened + 
##     1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists + 
##     1) + log(shouts + 1) + log(tenure + 1) + good_country, data = df, 
##     family = binomial())
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65446 -0.10987 -0.05571 -0.00430  1.06422 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -0.2652221  0.0232335 -11.416  < 2e-16 ***
## log(age)                        0.0625975  0.0079841   7.840 4.60e-15 ***
## male                            0.0225384  0.0026948   8.364  < 2e-16 ***
## log(friend_cnt + 1)             0.0005592  0.0023392   0.239   0.8111    
## log(avg_friend_age + 1)         0.0243932  0.0102037   2.391   0.0168 *  
## log(avg_friend_male + 1)        0.0031592  0.0060004   0.526   0.5985    
## log(friend_country_cnt + 1)    -0.0014172  0.0032037  -0.442   0.6582    
## log(subscriber_friend_cnt + 1)  0.0912123  0.0033791  26.993  < 2e-16 ***
## log(songsListened + 1)          0.0055781  0.0006097   9.150  < 2e-16 ***
## log(lovedTracks + 1)            0.0203240  0.0007199  28.232  < 2e-16 ***
## log(posts + 1)                  0.0161201  0.0015031  10.724  < 2e-16 ***
## log(playlists + 1)              0.0150537  0.0031679   4.752 2.02e-06 ***
## log(shouts + 1)                -0.0095469  0.0013196  -7.235 4.74e-13 ***
## log(tenure + 1)                -0.0138152  0.0025582  -5.400 6.68e-08 ***
## good_country                   -0.0294464  0.0026364 -11.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.259 on 43812 degrees of freedom
## Multiple R-squared:  0.09358,    Adjusted R-squared:  0.09329 
## F-statistic: 323.1 on 14 and 43812 DF,  p-value: < 2.2e-16

First, use all variables (continuous variables is taken log transformation) into logistic regression. Adopter is dependent variable. The summary result shows that friend_cnt, avg_friend_male_log and friend_country_cnt_log are not significant, based on the alpha = 0.05. So, I removed these two variables in the second logistic regression model.

In the second model, all variables are statistically significant.

#remove non-significant variables
lm2<- lm(adopter~ log(age) + male + log(avg_friend_age+1)  + log(subscriber_friend_cnt + 1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, data = df, family = binomial())
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded
summary(lm2)
## 
## Call:
## lm(formula = adopter ~ log(age) + male + log(avg_friend_age + 
##     1) + log(subscriber_friend_cnt + 1) + log(songsListened + 
##     1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists + 
##     1) + log(shouts + 1) + log(tenure + 1) + good_country, data = df, 
##     family = binomial())
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65458 -0.10992 -0.05563 -0.00434  1.06461 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -0.2659995  0.0228422 -11.645  < 2e-16 ***
## log(age)                        0.0621742  0.0079405   7.830 4.99e-15 ***
## male                            0.0226576  0.0026892   8.425  < 2e-16 ***
## log(avg_friend_age + 1)         0.0252542  0.0099806   2.530   0.0114 *  
## log(subscriber_friend_cnt + 1)  0.0909001  0.0030221  30.078  < 2e-16 ***
## log(songsListened + 1)          0.0056192  0.0005777   9.726  < 2e-16 ***
## log(lovedTracks + 1)            0.0202868  0.0007079  28.658  < 2e-16 ***
## log(posts + 1)                  0.0161080  0.0015013  10.730  < 2e-16 ***
## log(playlists + 1)              0.0149707  0.0031653   4.730 2.26e-06 ***
## log(shouts + 1)                -0.0096372  0.0011346  -8.494  < 2e-16 ***
## log(tenure + 1)                -0.0137723  0.0025563  -5.388 7.18e-08 ***
## good_country                   -0.0294245  0.0026313 -11.183  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.259 on 43815 degrees of freedom
## Multiple R-squared:  0.09357,    Adjusted R-squared:  0.09334 
## F-statistic: 411.2 on 11 and 43815 DF,  p-value: < 2.2e-16
#odd ratio
exp(lm2$coefficients)
##                    (Intercept)                       log(age) 
##                      0.7664395                      1.0641477 
##                           male        log(avg_friend_age + 1) 
##                      1.0229162                      1.0255758 
## log(subscriber_friend_cnt + 1)         log(songsListened + 1) 
##                      1.0951596                      1.0056350 
##           log(lovedTracks + 1)                 log(posts + 1) 
##                      1.0204940                      1.0162385 
##             log(playlists + 1)                log(shouts + 1) 
##                      1.0150833                      0.9904091 
##                log(tenure + 1)                   good_country 
##                      0.9863221                      0.9710042

Based on ANOVA, the model 2 is better for simplicity.

#ANOVA
anova(lm1, lm2, test="Chisq")
## Analysis of Variance Table
## 
## Model 1: adopter ~ log(age) + male + log(friend_cnt + 1) + log(avg_friend_age + 
##     1) + log(avg_friend_male + 1) + log(friend_country_cnt + 
##     1) + log(subscriber_friend_cnt + 1) + log(songsListened + 
##     1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists + 
##     1) + log(shouts + 1) + log(tenure + 1) + good_country
## Model 2: adopter ~ log(age) + male + log(avg_friend_age + 1) + log(subscriber_friend_cnt + 
##     1) + log(songsListened + 1) + log(lovedTracks + 1) + log(posts + 
##     1) + log(playlists + 1) + log(shouts + 1) + log(tenure + 
##     1) + good_country
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1  43812 2939.7                      
## 2  43815 2939.7 -3 -0.033112   0.9203
#multicollinearity
library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(lm2)
##                       log(age)                           male 
##                       2.299142                       1.100530 
##        log(avg_friend_age + 1) log(subscriber_friend_cnt + 1) 
##                       2.234923                       1.360421 
##         log(songsListened + 1)           log(lovedTracks + 1) 
##                       1.488033                       1.431314 
##                 log(posts + 1)             log(playlists + 1) 
##                       1.351067                       1.109004 
##                log(shouts + 1)                log(tenure + 1) 
##                       1.793807                       1.372650 
##                   good_country 
##                       1.031778

The AUC of ROC Curve is 0.791, which is great.

#ROC
library(pROC)
## Warning: package 'pROC' was built under R version 3.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
g = roc(df$adopter ~ predict(lm2, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(g, print.thres = "best" ,print.auc=TRUE)

#find the best threshold
best_thres = coords(g, "best", ret=c("threshold", "specificity", "sensitivity"))
## Warning in coords.roc(g, "best", ret = c("threshold", "specificity",
## "sensitivity")): The 'transpose' argument to FALSE by default since pROC
## 1.16. Set transpose = TRUE explicitly to revert to the previous behavior,
## or transpose = TRUE to silence this warning. Type help(coords_transpose)
## for additional information.
print(best_thres)
##    threshold specificity sensitivity
## 1 0.09444048   0.6613896   0.7839524
#for 1 unit increase in log(age), adopter will increase by the factor of exp()

Interpretation: For every 1 unit increase in age, adopter will increase by a factor of 1.064. For every 1 unit increase in male, adopter will increase by a factor of 1.023. For every 1 unit increase in average friend’s age (avg_freind_age), adopter will increase by a factor of 1.026. For every 1 unit increase in subscriber friend count (subscriber_friend_cnt), adopter will increase by a factor of 1.095. For every 1 unit increase in songListened, adopter will increase by a factor of 1.006. For every 1 unit increase in loveTracked, adopter will increase by a factor of 1.020. For every 1 unit increase in posts, adopter will increase by a factor of 1.016. For every 1 unit increase in playlists, adopter will increase by a factor of 1.015. For every 1 unit increase in shouts, adopter will decrease by a factor of 1.020. For every 1 unit increase in tenure, adopter will decrease by a factor of 0.986. For every 1 unit increase in good country, adopter will decrease by a factor of 0.971.

To sum up, the age, male, friend average age, subscriber friend count, song listened, love tracked, posts, and playlists are the variables which has positive correlation with adopter. The variables of shouts, tenure, good country are negatively correlated with adopter, which means that these the higher three variables are, it is less likely to become premium users (adopter = 1).

#5. Free-to-Fee strategy for HighNote According to the descriptive statistics and the models we built in the previous sections, it shows that, for example, the demographic variables, such as male and age, are positive related to the adopter. If a user who is a male with elder age, then he might be the target users for HighNote. So, it might be a good marketing strategy for HighNote to segment their users and find the aftermentioned target users (elder men) to do more advertising or personalized marketing.

For the peer influence variables, if a user actively interacts with lots of subscriber friends whose age belongs to elder age, the user is more likely to continuously engage on HighNote and become the target premium users. So, the company could create more activities among users, such as community or word-of-mouth marketing, to increase peer influence and further convert free users to premium users.

Regarding the user engagement variables, the company could create more interactions between company and users, such as public relation campaigns about sharing listened songs, loved tracks or playlists. These might help engaged users and keep using on this platform. Eventually, they will become premium users.

Last but not the least, High Note should revamp several functions or services on platform, such as shouts or tenure, to attract users. Also, it is better for High Note to expand its oversea market, not just focused on US, UK, Germany since based on the model result, many premium users are from other countries outside US, UK, Germany and so forth. Therefore, it is a good chance to explore new market and reach out to the target users.